JMIR Medical Informatics — Latest Matching Preprints

1

Sentiment in Clinical Notes: A Predictor for Length of Stay?

Boyne, A.; Feygin, M.; Sholeen, J.; Zimolzak, A.

2026-03-18 health informatics 10.64898/2026.03.16.26348553 medRxiv

Top 0.1%

13.8%

Show abstract

BackgroundLength of stay (LOS) is a critical metric for hospital operational efficiency. While structured clinical data is widely used to predict LOS, unstructured admission notes contain latent prognostic information regarding diagnostic uncertainty and disease complexity. This study evaluates the efficacy of extracting sentiment and direct LOS estimates from admission notes to predict patient hospitalization duration. MethodsWe conducted a retrospective study of 4,503 adult patients admitted with community-acquired pneumonia between 2013 and 2023. Admission history and physical notes were preprocessed and filtered to extract physician-generated narratives. We evaluated four natural language processing models, VADER, TextBlob, Longformer, and an open-source large language model (GPT-oss-20B), to generate zero-shot sentiment scores. Additionally, GPT-oss-20B was prompted to directly estimate LOS. Model outputs were correlated with actual LOS using linear regression and Pearson correlation coefficients. ResultsSentiment models demonstrated statistically significant, albeit weak, correlations with actual LOS. Longformer achieved the highest variance explained among sentiment classifiers (R2 = 0.019). Direct LOS estimation by the LLM outperformed sentiment-based approaches, demonstrating the strongest correlation with actual hospital duration (r = -0.218, p < 0.001). Model agreement was generally poor (ICC = 0.059), and computational time varied drastically, from 2.6 seconds per 100 notes (TextBlob) to over 370 seconds (GPT-oss-20B). ConclusionZero-shot sentiment analysis of clinical notes yields a small but measurable correlation with LOS, limited primarily by the objective, non-evaluative nature of clinical documentation. Direct LLM estimation of clinical outcomes outperforms emotional sentiment extraction. Future predictive systems should integrate computationally efficient NLP models capable of capturing latent clinical complexity alongside established structured data variables.

2

A customizable calculation tool for allocation of adrenal vein sampling in primary aldosteronism in diverse populations

Leung, A. A.; Przybojewski, S. J.; Klamrowski, M.; Caughlin, C. E.; Wright, C.; Pasieka, J. L.; Wu, V.-C.; Lin, Y.-H.; Tsai, R.; Chang, C.-C.; Hundemer, G. L.; King, J.; Austin, K.; Mellor, K.; Hu, L.; Low, J.; Burkart, J.; Kline, G. A.

2026-02-06 endocrinology 10.64898/2026.02.05.26345289 medRxiv

Top 0.1%

10.2%

Show abstract

BackgroundPrimary aldosteronism(PA) screening is recommended but disease prevalence exceeds the availability of adrenal vein sampling(AVS). MethodsAn AVS optimal allocation tool for health systems was developed using administrative data and AVS registries from Calgary and Taiwan. Four easily-definable phenotypes of PA based on an elevated aldosterone-renin-ratio (ARR), and the presence/absence of hypokalemia or adrenal mass were identified, representing progressively severe PA and stepwise increasing rates of AVS-defined lateralization. Using supply-and-demand principles, a customizable, web-based tool was developed that considers PA referral volume, PA phenotype prevalence, maximum AVS available/year, AVS success rate, and desired rate of finding unilateral disease. ResultsThe most prevalent phenotype of PA was characterized by an elevated ARR and hypokalemia but no adrenal mass (41.9 [39.9-43.9]%); hypokalemia and adrenal mass accounted for (15.6[14.4-16.9]%) of cases. There was a progressive increase in AVS lateralization rate with increasing severity of phenotype observed in both the Calgary and Taiwan data, ranging from (20-39%) in those with PA without hypokalemia or adrenal mass to (70-90%) in those with hypokalemia and adrenal mass. After accounting for institution-specific lateralization rates and allowing for system-level differences in high- and low-volume PA referrals, and high- and low AVS availability, the customizable AVS allocation tool was able to generate individualized strategies ranging from restrictive (exclusive reservation of AVS for cases with hypokalemia and adrenal mass) to more inclusive strategies (assigning a proportion of AVS allocation to less severe PA cases). ConclusionsAn AVS allocation tool that uses common, simple, and globally available PA case data may assist in health system AVS program case allocation for maximum equity and wait-list control. GRAPHICAL ABSTRACT O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=143 SRC="FIGDIR/small/26345289v1_ufig1.gif" ALT="Figure 1"> View larger version (34K): org.highwire.dtl.DTLVardef@16666f5org.highwire.dtl.DTLVardef@1f1740forg.highwire.dtl.DTLVardef@fa85c8org.highwire.dtl.DTLVardef@1651733_HPS_FORMAT_FIGEXP M_FIG C_FIG

3

Gender-Specific Osteoporosis Risk Prediction Using Longitudinal Clinical Data and Machine Learning

Tripathy, S.; Saripalli, L.; Berry, K.; Jayasuriya, A. C.; Kaur, D.; Syed, F.

2026-02-17 orthopedics 10.64898/2026.02.13.26346244 medRxiv

Top 0.1%

10.0%

Show abstract

Osteoporosis is a silent yet debilitating disease that often remains undetected until fractures occur. While early prediction is crucial, most studies combine male and female datasets to train a single model, introducing bias since osteoporosis risk and progression differ by gender. This study aims to develop gender-specific machine learning models that leverage longitudinal data to predict osteoporosis risk, providing tailored insights for men and women. Data were obtained from two large longitudinal cohorts: the Study of Osteoporotic Fractures (SOF) for women and the Osteoporotic Fractures in Men Study (MrOS) for men. Multiple ML algorithms were trained and evaluated for each sex, with model performance assessed using the area under the receiver operating characteristic curve (AUC-ROC). Among the tested models, the XGBoost model demonstrated the best performance for women, achieving an AUC-ROC of 0.93 using SOF data. For men, the Random Forest model achieved an AUC-ROC of 0.89 using MrOS data. Feature importance analysis identified sex-specific osteoporosis risk factors, underscoring the need for tailored prediction and management. By revealing male and female risk factors and reducing bias from combined datasets, the work advances personalized care and supports earlier, effective clinical intervention to prevent fractures and improve health outcomes.

4

Extending the OMOP Common Data Model to Support Observational Peripheral Vascular Disease Research

Leese, P. J.; McIntee, T.; Browder, S. E.; Laivuori, M.; Alabi, O.; McGinigle, K. L.

2026-02-03 health informatics 10.64898/2026.02.01.26345276 medRxiv

Top 0.1%

9.9%

Show abstract

BackgroundPeripheral artery disease (PAD) and chronic limb-threatening ischemia (CLTI) cause substantial morbidity and mortality, yet research progress is limited by fragmented, non-standardized data. The Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM) provides a standardized framework for electronic health record (EHR) research but lacks domain-specific detail for peripheral vascular diseases. This study aimed to develop and test a vascular-specific OMOP CDM extension to improve data standardization, enable reproducible real-world analyses, and support precision medicine research in PAD and CLTI. MethodsWe identified patients with PAD, CLTI, or diabetic foot ulcers who sought care within the UNC Health System between April 2014 and July 2024. Standard OMOP tables were supplemented with peripheral vascular laboratory (PVL) data and state death records. Intermediate tables were designed for key clinical domains (e.g., smoking, comorbidities, revascularizations) to enhance reusability. Predictive models for revascularization and mortality were developed using logistic regression with Bayesian weighting and Markov Chain Monte Carlo feature selection. Clinical ApplicationThe revascularization model displayed high performance with and without important vascular variables (AUC = 0.970 and AUC 0.969, respectively), while the mortality model demonstrated moderate accuracy (AUC = 0.656) that improved with inclusion of vascular-specific features (AUC = 0.752). ConclusionsThis vascular OMOP extension represents one of the first specialty-specific frameworks for peripheral vascular research. By extending the OMOP CDM to a vascular domain, this work advances both the technical framework and scientific capability of real-world data research in limb preservation and precision vascular medicine.

5

Class imbalance correction in artificial intelligence models leads to miscalibrated clinical predictions: a real-world evaluation

Roesler, M. W.; Wells, C.; Schamberg, G.; Gao, J.; Harrison, E.; O'Grady, G.; Varghese, C.

2026-03-05 health informatics 10.64898/2026.03.04.26347634 medRxiv

Top 0.1%

9.8%

Show abstract

BackgroundPredictive models employing machine learning algorithms are increasingly being used in clinical decision making, and improperly calibrated models can result in systematic harm. We sought to investigate the impact of class imbalance correction, a commonly applied preprocessing step in machine learning model development, on calibration and modelled clinical decision making in a large real-world context. MethodsA histogram boosted gradient classifier was trained on a highly imbalanced national dataset of >1.8 million patients undergoing surgery, to predict the risk of 90-day mortality and complications after surgery. Class imbalance correction strategies including random oversampling, synthetic minority oversampling technique, random under-sampling, and cost-sensitive learning were compared to the natural distribution ( natural). Models were tested and compared with classification metrics, calibration plots, decision curve analysis, and simulated clinical impact analysis. ResultsThe natural model demonstrated high performance (AUROC 0.94, 95% CI 0.94-0.95 for mortality; 0.84, 95% CI 0.84-0.85 for complications) and calibration (log loss 0.05, 95% CI 0.04-0.05 for mortality; 0.23, 95% CI 0.23-0.24 for complications). Class imbalance mitigation (CSL, ROS, RUS, and SMOTE) did not improve AUROC or AUPRC but increased recall and F1 scores at the expense of precision and accuracy. However, these methods severely compromised model calibration, leading to significant over-prediction of risks (up to a 62.8 % increase) as further evidenced by increased log loss across all mitigation techniques. Decision curve analysis and clinical scenario testing confirmed that the natural model provided the highest net benefit. ConclusionClass imbalance correction methods result in significant miscalibration, leading to possible harm when used for clinical decision making.

6

Development and Temporal Evaluation of Multimodal Machine Learning Models to Predict High Inpatient Opioid Exposure

Kale, S.; Singh, D.; Truumees, E.; Geck, M.; Stokes, J.

2026-04-02 health informatics 10.64898/2026.03.31.26349842 medRxiv

Top 0.1%

8.3%

Show abstract

High inpatient opioid exposure is associated with increased risk of persistent opioid use. Early identification of high-risk patients may improve opioid stewardship. We developed machine learning models to predict high opioid exposure during hospitalization using electronic health record data from MIMIC-IV. We conducted a retrospective study of 223,452 unique first hospital admissions in MIMIC-IV. The outcome was high opioid exposure, defined as the top decile among opioid-exposed admissions (MME/day [≥] 225), representing 2.65% of all admissions. Structured early-admission features included demographics, admission characteristics, laboratory utilization and abnormality summaries, and 24-hour procedural indicators. Discharge-note data were incorporated using ClinicalBERT embeddings and interpretable bigram features. Models were trained using an 80/10/10 split and evaluated with temporal validation on the most recent 10% of admissions. Performance was assessed using ROC-AUC and PR-AUC with 95% confidence intervals. Among structured-only models, XGBoost achieved the best test performance (ROC-AUC 0.932 [0.924-0.940]; PR-AUC 0.223 [0.193-0.262]). The combined structured and notes model improved precision-recall performance (ROC-AUC 0.932 [0.920-0.943]; PR-AUC 0.276 [0.229-0.331]). Temporal evaluation showed similar discrimination (ROC-AUC 0.929; PR-AUC 0.223). High-risk bigrams included procedural terms such as "external fixation" and "cervical discectomy." Integration of structured and text-derived features improved risk stratification compared to structured data alone. Interpretable bigram signals reflected procedural complexity and orthopedic pathology, reinforcing the clinical plausibility of model predictions. Multimodal EHR-based models accurately predict high inpatient opioid exposure and may support targeted opioid stewardship during hospitalization.

7

AI-Powered Pipeline for Annotating Echocardiography Notes and Prognostic Variable Analysis in Critical Care

Xu, S.; Ma, T.; Duan, C.; IP, A.; Tam, C.; LEUNG, Y.; Yang, J.; SIN, S.; CHEUNG, E.; Yiu, K.-H.; Yeung, P.

2026-03-10 health informatics 10.64898/2026.03.09.26347835 medRxiv

Top 0.1%

8.3%

Show abstract

BackgroundEchocardiography (echo) notes contain valuable prognostic information for patients in the intensive care unit (ICU). However, their unstructured format and the presence of sensitive patient information present challenges for large-scale, automated analysis. There is a need for secure and efficient methods to extract and utilize echo data to enhance ICU outcome prediction. MethodsWe developed an AI-powered, privacy-preserving pipeline that leverages advanced natural language processing and pattern matching to annotate echo notes locally, ensuring comprehensive masking of personally identifiable information. This pipeline was applied to patient data from a mixed medical surgical ICU in a tertiary referral hospital. Key variables were extracted from echo notes and integrated with clinical and laboratory data to predict ICU mortality. A LightGBM machine learning model--robust to missing values--was trained using both routine and echo-derived structured clinical features. Its predictive performance was compared to that of the APACHE IV score. ResultsCompared with the reference standard derived from manual annotation by echocardiography specialist, automated annotation of echo notes achieved 98.85% data accuracy with a false positive rate of 0.31%. Several echo-derived variables, including left ventricular ejection fraction (LVEF), left ventricular outflow tract velocity time integral (LVOT VTI), tricuspid annular plane systolic excursion (TAPSE), mitral regurgitation (MR), and aortic regurgitation (AR), were strongly associated with ICU mortality. Incorporating echo-derived variables improved the accuracy in prediction of ICU mortality, with the LightGBM model achieving an AUC of 0.902 compared to 0.861 for APACHE IV score. ConclusionOur locally deployable AI pipeline enables secure and automated extraction of prognostic information from echo notes, substantially enhancing ICU mortality prediction. The inclusion of echo-derived variables significantly improved predictive accuracy, underscoring the potential but currently underutilized value of unstructured notes. This approach paves the way for scalable, privacy-preserving decision support tools in critical care.

8

DR. INFO at the Point of Care: A Prospective Pilot Study of an Agentic AI Clinical Assistant

Corga Da Silva, R.; Romano, M.; Mendes, T.; Isidoro, M.; Ravichandran, S.; Kumar, S.; van der Heijden, M.; Fail, O.; Gnanapragasam, V. E.

2026-04-01 health informatics 10.64898/2026.03.31.26349817 medRxiv

Top 0.1%

7.2%

Show abstract

Background: Clinical documentation and information retrieval consume over half of physicians working hours, contributing to cognitive overload and burnout. While artificial intelligence offers a potential solution, concerns over hallucinations and source reliability have limited adoption at the point of care. Objective: To evaluate clinician-reported time savings, decision-making support, and satisfaction with DR. INFO, an agentic AI clinical assistant, in routine clinical practice. Methods: In this prospective, single-arm pilot study, 29 clinicians across multiple specialties in Portuguese healthcare institutions used DR. INFO v1.0 over five working days within a two-week period. Outcomes were assessed via daily Likert-scale evaluations and a final Net Promoter Score. Non-parametric methods were used throughout. Results: Clinicians reported high perceived time saving (mean 4.27/5; 95% CI: 3.97-4.57) and decision support (4.16/5; 95% CI: 3.86-4.45), with ratings stable across all study days and no evidence of attrition bias. The NPS was 81.2, with no detractors. Conclusions: Clinicians across specialties and career stages reported sustained satisfaction with DR. INFO for both time efficiency and clinical decision support. Validation in larger, controlled studies with objective outcome measures is warranted. Keywords: Medical AI assistant, LLMs in healthcare, Agentic AI, Clinical decision support, Point of care AI

9

Development and Validation of the Intensive Documentation Index for ICU Mortality Prediction: A Temporal Validation Study

Collier, A.

2026-03-18 health informatics 10.64898/2026.02.10.26345827 medRxiv

Top 0.1%

6.7%

Show abstract

BackgroundNursing documentation patterns may reflect patient acuity and clinical deterioration, yet their prognostic value remains underexplored. We developed the Intensive Documentation Index (IDI), a novel framework quantifying temporal documentation rhythms, and evaluated its ability to enhance ICU mortality prediction.58 MethodsWe analyzed 26,153 ICU admissions of heart failure patients from the MIMIC-IV database (2008-2019). Nine IDI features capturing documentation rhythm, volume, and surveillance gaps were extracted from electronic health record timestamps during the first 24 hours of ICU stay. We compared logistic regression models with and without IDI features using temporal validation and race-stratified analysis.2124 ResultsThe cohort had a mean age of 68.5 {+/-} 13.2 years and an in-hospital mortality rate of 15.99% (n=4,181). The baseline model (age, sex, ICU length of stay) achieved an AUC of 0.658 (95% CI 0.609-0.710). Addition of nine IDI features significantly improved discrimination to 0.683 (95% CI 0.631-0.732), an absolute increase of 0.025 (p<0.05, DeLong test). Leave-one-year-out cross-validation across 12 years yielded a mean AUC of 0.684 (SD 0.008). The coefficient of variation of inter-event intervals (idi_cv_interevent) was the strongest predictor (OR 1.53 per SD, 95% CI 1.38-1.70, p<0.001). Model performance was consistent across racial and ethnic groups (AUC range 0.673-0.691), with no evidence of systematic bias. ConclusionsDocumentation rhythm patterns, captured through the IDI framework, significantly enhance ICU mortality prediction beyond traditional clinical variables. The association between documentation irregularity and mortality may reflect nursing workload, patient acuity, or care processes warranting further investigation. IDI represents a novel, readily available prognostic signal that could inform future clinical decision support systems.25

10

Improving Medicare Fraud Detection Accuracy in Deep Learning by Exploring Feature Selection and Data Sampling Techniques.

Ahammed, F.

2026-03-20 health informatics 10.64898/2026.03.18.26348763 medRxiv

Top 0.1%

6.5%

Show abstract

Fraud in the health landscape is an aggravating issue, with far-reaching consequences burdening the financial stability of the health industry and threatening the quality of medical care. It results from vulnerabilities within the current healthcare framework that are exploited by the fraudsters in their favor. In spite of many developed models that aim to detect fraudulent patterns in insurance claims, the accuracy of such models frequently suffers as a result of the imbalance issue of the Medicare dataset and irrelevant features. This study ventures to improve detection performance and accuracy by employing a deep learning model along with data sampling and feature selection techniques. Comparative analysis among different combinations is conducted to determine their efficacy to enhance the accuracy of the fraud detection model. Hence, the suggested model clearly demonstrates that a combination of myriad data sampling and feature selection techniques is helping to improve accuracy and performance. The accuracy was thus 95.4%, with negligible evidence of overfitting detected using both Chi-square and Synthetic Minority Over-sampling (SMOTE) techniques. Ultimately, the study findings underscore the significance of employing combined techniques instead of using only the baseline deep learning model for better performance in detecting Medicare insurance fraud.

11

Context-Aware Emergency Department Triage Using Pairwise Comparisons and Bradley-Terry Aggregation

Jarrett, P.; Reeder, J.; McDonald, S.; Diercks, D.; Jamieson, A. R.

2026-03-17 health informatics 10.64898/2026.03.14.26348412 medRxiv

Top 0.1%

6.4%

Show abstract

STRUCTURED ABSTRACTO_ST_ABSObjectiveC_ST_ABSTo evaluate a ranking approach for emergency department (ED) waiting room prioritization that uses pairwise clinical comparisons aggregated via a Bradley-Terry model, and to assess its cross-site stability without site-specific training. Materials and MethodsUsing the Multimodal Clinical Monitoring in the Emergency Department (MC-MED) dataset (118,385 ED visits, Site A), we defined a composite deterioration outcome (intensive care unit [ICU] admission, intubation, vasopressor, ventilation, or death within 6 hours) and evaluated 7 queue-ordering policies across 1,000 simulated shifts. The primary endpoint was Recall@5 for deteriorators; secondary endpoints included area under the receiver operating characteristic curve (AUROC) and simulated time-to-provider (TTP) metrics. External validation used MIMIC-IV-ED (425,087 visits, Site B) with 500 shifts. Methods reported per TRIPOD-LLM. ResultsOn MC-MED, BT-LLM-Enriched (Bradley-Terry ranking with a large language model [LLM] judge, GPT-4.1, using full diagnoses and medications) exceeded the Emergency Severity Index (ESI) on the primary endpoint: Recall@5 0.587 vs. 0.491 (p<0.001). XGBoost achieved Recall@5 0.648 but required large site-specific labeled training data. On external validation, supervised model performance attenuated (XGBoost AUROC 0.892 to 0.807) while BT-LLM-Enriched remained stable (0.826 to 0.831); the two were statistically indistinguishable on external data. DiscussionUnder external validation, supervised model performance attenuated while zero-shot LLM ranking remained stable, suggesting cross-site stability without requiring site-specific training data. ConclusionPairwise ranking with an LLM judge significantly outperforms ESI-based ordering and remains stable across sites without local training, matching supervised models on external data.

12

Electronic Health Record-Based Estimation of Kansas City Cardiomyopathy Questionnaire Scores in Heart Failure

Kim, Y. W.; Lau, W.; Patel, N.; Kendrick, K.; Wu, A.; Feldman, T.; Ahern, R.; Oka, A.

2026-04-05 health informatics 10.64898/2026.04.03.26350138 medRxiv

Top 0.1%

6.4%

Show abstract

Background: The Kansas City Cardiomyopathy Questionnaire (KCCQ) is a validated patient-reported outcome measure for heart failure. However, its clinical utility is limited by incomplete and inconsistent data collection. We aimed to develop and validate machine learning models to estimate KCCQ overall summary scores from electronic health record (EHR) data. Methods: We assembled a retrospective cohort of 10,889 heart failure patients with recorded KCCQ scores from the Truveta database. Predictor features were derived from structured EHR variables across 13 historical time windows (15-360 days). Multiple regression algorithms were evaluated, followed by SHapley Additive exPlanations (SHAP)-based feature reduction and nested cross-validation for hyperparameter optimization. Model performance was assessed using the coefficient of determination (R2), mean absolute error (MAE), and ordinal discrimination and calibration for categorical severity classification. Results: Histogram-based gradient boosting (HGB) with HGB-SHAP feature selection achieved the strongest performance, reducing feature dimensionality by more than 94\% while maintaining estimation accuracy. The 240-day window performed best (R2=0.522, MAE=12.485). For categorical severity classification, the model demonstrated strong ordinal discrimination (mean ordinal AUROC=0.850). Quantile-based calibration improved classification balance, increasing the F1-score for the most severe category (KCCQ<25) from 0.180 to 0.428 and the quadratic weighted kappa from 0.601 to 0.640. Longer EHR observation windows were associated with improved prediction performance. Conclusion: Machine learning models can estimate KCCQ scores from routine EHR data with clinically meaningful accuracy and strong discriminatory performance. This approach may help extend assessment of patient-reported health status to populations in which survey-based data are incompletely captured, supporting population-level cardiovascular outcomes assessment and risk stratification in heart failure care.

13

Augmenting Electronic Health Records for Adverse Event Detection

Kaynar, G.; You, Z.; Boyce, R. D.; Yakoh, T.; Kingsford, C.

2026-02-11 health informatics 10.64898/2026.02.10.26345962 medRxiv

Top 0.1%

6.4%

Show abstract

ObjectiveAdverse events (AEs) resulting from medical interventions are significant contributors to patient morbidity, mortality, and healthcare costs. Prediction of these events using electronic health records (EHRs) can facilitate timely clinical interventions. However, effective prediction remains challenging due to severe class imbalance, missing labels, and the complexity of EHR records. Classical machine learning approaches frequently underperform due to insufficient representation of minority adverse event classes and limited capacity to capture interactions among patient demographics, administered medications, and associated complications. MethodsWe introduce TASER-AE, a novel data augmentation pipeline tailored for structured EHR data, coupled with transformer-based classification. TASER-AE addresses these issues through an NLP-inspired data augmentation framework adapted for EHR, enabling effective minority-class representation in sparse and imbalanced clinical datasets. The augmented records produced by TASER-AE alleviate class imbalance by enriching the representation of minority adverse event classes, which enhances the robustness and predictive performance of the classifier. ResultsTASER-AE yields minority-class F1 scores up to 0.70, substantially surpassing classical machine-learning baselines and prior augmentation methods across multiple adverse event tasks. Experiments conducted on two distinct EHR datasets confirm TASER-AEs ability to substantially improve adverse event detection performance. ConclusionThese results demonstrate the potential of structured, NLP-inspired augmentation methods to overcome data limitations in clinical predictive modeling, ultimately contributing to improved patient safety outcomes. TASER-AE is available at https://github.com/Kingsford-Group/taserae.

14

Development and validation of an algorithm to identify front-line clinicians using EHR audit log data

Baratta, L. R.; Wang, J.; Osweiler, B. W.; Lew, D.; Eiden, E.; Kannampallil, T. G.; Lou, S. S.

2026-02-16 health informatics 10.64898/2026.02.13.26346268 medRxiv

Top 0.1%

6.4%

Show abstract

BackgroundInterprofessional teams are central to high quality patient care. However, identifying the clinician primarily responsible for a patient requires labor-intensive methodologies. Although electronic health record (EHR) audit logs offer a scalable alternative, its use for identifying frontline clinicians is underdeveloped. ObjectiveTo develop and validate an algorithm utilizing EHR audit logs to identify the primary frontline clinician per patient day of an encounter and to describe care continuity patterns. MethodThis was a cross-sectional cohort study of adult inpatient medicine encounters at 12 hospitals in a single health system using a shared EHR. Admissions from February 1, 2023-April 30, 2023, with length of stay of at least 3 days and without an intensive care unit admission were included. Four algorithm iterations were designed to identify the attending physician, resident, or advanced practice provider primarily responsible for patient care on each patient-day. Performance of each algorithm was compared with manual chart review on 1,401 patient-days from 246 randomly sampled patient encounters. Accuracy between an algorithm and the chart review standard was compared using McNemars test with Bonferroni adjusted p-values. ResultsThe best performing algorithm correctly identified the primary clinician responsible for patient care on 91% of patient-days (1,268/1,401), outperforming the naive approach using frequency of actions (78% accuracy, 1,098/1,401, p<0.001). Algorithm errors were attributable to misidentified specialty and ambiguity on days with transitions of care or shared responsibilities between clinicians. The best performing algorithm was applied to the entire cohort (5,801 encounters and 34,001 patient-days) where it identified attending physicians, resident physicians, and APPs as the frontline clinician for 26,750 (79%), 3,106 (9%), and 4,145 (12%) of patient days respectively. Each encounter had a median of 1 (IQR 0-2) handoff between frontline clinicians. ConclusionsWe developed a scalable, audit log-based algorithm to determine the front-line clinician with excellent accuracy compared with manual chart review.

15

Artificial Intelligence in Healthcare: 2025 Year in Review

Edara, R.; Khare, A.; Atreja, A.; Awasthi, R.; Highum, B.; Hakimzadeh, N.; Ramachandran, S. P.; Mishra, S.; Mahapatra, D.; Shree, S.; Bhattacharyya, A.; Singh, N.; Reddy, S.; Cywinski, J. B.; Khanna, A. K.; Maheshwari, K.; Papay, F. A.; Mathur, P.

2026-02-28 health informatics 10.64898/2026.02.23.26346888 medRxiv

Top 0.1%

6.3%

Show abstract

BackgroundBreakthroughs in model architecture and the availability of data are driving transformational artificial intelligence in healthcare research at an exponential rate. The shift in use of model types can be attributed to multimodal properties of the Foundation Models, better reflecting the inherently diverse nature of clinical data and the advancing model implementation capabilities. Overall, the field is maturing from exploratory development towards application in real-world evaluation and implementation, spanning both Generative and predictive AI. MethodsDatabase search in PubMed was performed using the terms "machine learning" or "artificial intelligence" and "2025", with the search restricted to English-language human-subject research. A BERT-based deep learning classifier, pre-trained and validated on manually labeled data, assessed publication maturity. Five reviewers then manually annotated publications for healthcare specialty, data type, and model type. Systematic reviews, duplicates, pre-prints, robotic surgery studies, and non-human research publications were excluded. Publications employing foundation models were further analyzed for their areas of application and use cases. ResultsThe PubMed search yielded 49,394 publications, a near-doubling from 28,180 in 2024, of which 3,366 were classified as mature. 2,966 were included in the final analysis after exclusions, compared to 1946 in 2024. Imaging remained the dominant specialty (976 publications), followed by Administrative (277) and General (251). Traditional text-based LLMs (1,019) led model usage, but Multimodal Foundation Models surged from 25 publications in 2024 to 144 in 2025, and Deep Learning models also increased substantially (910). For the first time, publications related to classical Machine Learning model use declined (173) in our annual review. Image remained the predominant data type (53.9%), followed by text (38.2%), with a notable increase in audio (1.2%) coinciding with the adoption of multimodal models. Across foundation model publications, Imaging (110), Head and Neck (92), Surgery (64), Oncology (55), and Ophthalmology (49) were leading specialties, while Administrative and Education categories remained high-volume contributors driven predominantly by LLM-based research. Conclusion2025 signals a meaningful maturation of the healthcare AI research field, with publication volumes nearly doubling, classical ML yielding to higher-capacity foundation models, and the field rapidly moving beyond traditional text-based LLM capabilities toward multimodal models. While Imaging continues to lead in research output, the growth of multimodal models across clinical specialties suggests the field is approaching an inflection point where AI systems can more closely mirror the complexity of real-world clinical practice.

16

Falsification Testing of Sepsis Prediction Models: Evaluating Independent Biological Signal After Controlling for Care-Process Intensity

Dickens, A. R.

2026-03-18 health informatics 10.64898/2026.03.17.26348414 medRxiv

Top 0.1%

6.3%

Show abstract

BackgroundAutomated sepsis early-warning systems have attracted substantial research investment, yet a fundamental question remains unresolved: do these models detect independent biological signals, or do they predominantly learn care-process intensity -- the pattern of clinician ordering behavior applied to patients already suspected of being ill? We report a pre-registered falsification study testing this hypothesis across four independent clinical datasets. MethodsA four-phase falsification framework with pre-specified thresholds was registered on OSF (March 11, 2026) before any data access. The primary confirmatory analysis used MIMIC-IV v3.1 (n=65,241 adult ICU stays, Beth Israel Deaconess Medical Center, 2008-2022). Exploratory replication analyses used eICU-CRD v2.0 (n=136,864, 208 US hospitals), MIMIC-III v1.4 (n=44,091), and the PhysioNet/CinC 2019 Sepsis Challenge (n=40,314). Each phase tested a distinct falsification criterion: (1) concordance across Sepsis-2, Sepsis-3, and CMS SEP-1 definitions; (2) model performance degradation when care-intensity proxy features are removed; (3) predictive performance of care-intensity features alone; and (4) discriminability of synthetic records generated to match care-intensity distributions. ResultsThe pre-registered primary analysis (MIMIC-IV) did not confirm the hypothesis (0/4 phases confirmed). Biological features predicted Sepsis-3 labels with AUROC 0.901 (95% CI 0.899-0.904); removing care-intensity features reduced performance by only 0.003 AUROC (drop=0.0027). The pre-specified Phase 3 threshold (care-only AUROC >0.70) was not met by the primary logistic regression model (AUROC 0.660); however, a sensitivity XGBoost model did exceed the threshold (AUROC 0.729), suggesting nonlinear care-intensity signal. However, a clinically significant finding emerged consistently across all four datasets: mean pairwise Jaccard similarity between clinical sepsis definitions and administrative coding (CMS SEP-1) was approximately 0.32 at the primary site and 0.20 across multi-center cohorts, indicating that hospital quality metrics and regulatory reporting systematically measure a different patient population than clinical definitions identify. Exploratory analyses revealed a detectable care-intensity signal in the eICU multi-center cohort (AUC drop=0.076) not present at the single academic center. ConclusionsAt an elite academic medical center, sepsis prediction models detect genuine biological signal. Care-process leakage is not the primary driver of model performance in MIMIC-IV. The more consequential and robust finding is the systematic divergence between clinical and administrative sepsis definitions across all datasets examined, which has direct implications for regulatory reporting, pay-for-performance metrics, and the validity of AI benchmarks built on administrative data.

17

Nationwide Prediction of Missed and Cancelled Appointments Using Real-World EHR Data

Miran, S. A.; Cheng, Y.; Faselis, C.; Brandt, C.; Vasaitis, S.; Nesbitt, L.; Zanin, L.; Tekle, S.; Ahmed, A.; Nelson, S. J.; Zeng-Treitler, Q.

2026-04-13 health informatics 10.64898/2026.04.08.26349942 medRxiv

Top 0.2%

4.9%

Show abstract

ObjectivesTo develop and evaluate predictive models for unused outpatient appointments (missed or cancelled) using a large national electronic health record (EHR) repository in the United States. DesignRetrospective observational study using machine learning and statistical modeling. SettingA U.S. national electronic health record repository (Cerner Real World Database) covering healthcare encounters from 2010 to 2025. ParticipantsAdult patients aged [≥]18 years with routine outpatient encounters recorded in the database. One outpatient appointment with a known status was randomly selected per patient, resulting in a final analytic sample of 5,699,861 encounters. Primary and Secondary Outcome MeasuresThe primary outcome was whether the index outpatient appointment was attended or unused (missed or cancelled). Model performance was evaluated using area under the receiver operating characteristic curve (AUC), sensitivity, and specificity. MethodsPredictors included patient characteristics (demographics and insurance type), appointment characteristics (day, time, season, and urbanicity), prior cancellation rate, and time gap between the index appointment and the previous visit. We compared the predictive performance of two machine learning models (random forest classifier and extreme gradient boosting (XGBoost)) with logistic regression. An explainable AI analysis of feature impact was performed on the final XGBoost model. ResultsAmong 5,699,861 outpatient encounters, 3,650,715 (64.0%) were attended and 2,049,146 (36.0%) were unused. XGBoost achieved the best predictive performance on the test dataset (AUC = 0.95), followed by random forest (AUC = 0.92) and logistic regression (AUC = 0.89). Feature impact score analysis revealed highly non-linear associations between predictors and the risk of unused appointments at the individual level. ConclusionsUnused outpatient appointments can be accurately predicted using routinely available EHR data. Integrating predictive models into scheduling workflows may improve healthcare efficiency and optimize appointment management. Article SummaryStrengths and limitations of this study O_LIThis study used one of the largest national electronic health record datasets to develop predictive models for unused outpatient appointments. C_LIO_LIMultiple modeling approaches, including logistic regression and machine learning methods (random forest and XGBoost), were compared to evaluate predictive performance. C_LIO_LIAn explainable artificial intelligence method was applied to quantify feature impact and improve model interpretability. C_LIO_LIThe retrospective design and reliance on routinely collected EHR data may introduce data quality limitations and unmeasured confounding. C_LIO_LIThe database did not distinguish clearly between cancelled appointments and no-shows. C_LI

18

Can Machine Learning Algorithms use Contextual Factors to Detect Unwarranted Clinical Variation from Electronic Health Record Encounter Data during the Treatment of Children Diagnosed with Acute Viral Pharyngitis

mcowiti, a. O.; Neaimeh, Y. R.; Gu, J.; Lalani, Y.; Newsome, T. C.; nguyen, Y. H.; Shrager, S.; Rasmy, L. O.; Fenton, S. H.

2026-03-02 health informatics 10.64898/2026.02.23.26346757 medRxiv

Top 0.2%

4.9%

Show abstract

Rationale, Aims and ObjectivesUnwarranted clinical variation (UCV) in patient care often arises from contextual factors and contributes to increased costs, unnecessary treatments, and deviations from evidence-based practice. Detecting UCV is challenging due to the complexity of care decisions. Current approaches rely on centralized data aggregation and mixed-effects regression, which estimate relative variation but cannot detect absolute variation. Moreover, machine learning (ML) methods leveraging contextual factors for UCV detection are lacking. The objective is to demonstrate the feasibility of ML for identifying absolute UCV using contextual features extracted from electronic health records (EHR) and identify the factors correlated with UCV in treating acute viral pharyngitis in children. MethodsWe conducted a retrospective study of pediatric ambulatory visits (ICD-10 J02.8) at an academic health system. The use case focused on unwarranted antibiotic prescriptions for acute viral pharyngitis. We trained ensemble ML models--Random Forest, CatBoost, and Explainable Boosting Machine (EBM)--using encounter-level EHR data. Performance was evaluated using nested cross-validation and AUC metrics. We also compared CatBoost models trained on curated (gold-standard) versus weak labels. ResultsAll three ML models demonstrated robust performance, with a median AUC of 0.91, using data from 24 clinics, 81 providers, and 122 patients within an academic health system. CatBoost models trained on weak labels exhibited performance comparable to those trained on gold-standard labels. Feature importance analysis indicated that site-level and provider-level case volumes were the most influential predictors, followed by provider credential, years of experience, and encounter type. Notably, lower provider case volumes were associated with a reduced likelihood of inappropriate treatment. ConclusionsClassical ML models can effectively detect absolute UCV using contextual EHR features. Explainable models such as EBM offer interpretability critical for clinical adoption. These findings support ML-based approaches as scalable alternatives to traditional statistical methods for UCV detection without requiring centralized data analysis.

19

A bibliometric review of explainable AI in diabetes risk prediction: Trends, gaps, and knowledge graph opportunities

Van, T. A.

2026-04-20 health informatics 10.64898/2026.04.16.26351069 medRxiv

Top 0.2%

4.7%

Show abstract

BackgroundType 2 diabetes mellitus (T2DM) is a leading global public health challenge. Machine learning (ML) combined with Explainable AI (XAI) is increasingly applied to T2DM risk prediction, but the field lacks a quantitative overview of methodological trends and integration gaps. MethodsWe present a structured synthesis and critical analysis of the XAI literature on T2DM risk prediction, combining (i) quantitative bibliometric analysis of a two-database corpus (N = 2,048 documents from Scopus and PubMed/MEDLINE, deduplicated via a transparent three-tier pipeline) and (ii) an in-depth selective review of 15 highly cited papers. Reporting follows PRISMA 2020, adapted for metadata-based synthesis; analyses include keyword frequency, rule-based thematic clustering, and publication trend analysis. ResultsThe field grew rapidly, from 36 documents (2020) to 866 (2025). SHAP and LIME dominate XAI methods; XGBoost and Random Forest dominate ML models. Critically, KG/GNN terms appeared in only 17 documents ([~]0.83%) compared with 906 for XAI methods, a 53.3:1 disparity. This gap is consistent across both databases, which share 33.2% of their records, ruling out a single-database artifact. The selective review confirmed that none of the 15 highly cited papers combined all three components, ML, XAI, and KG, in T2DM risk prediction. ConclusionsThe XAI for T2DM risk prediction field exhibits a clinical interpretability gap: statistical explanations are rarely linked to structured clinical pathways. We propose a three-layer conceptual framework (Predictive [->] Explainability [->] Knowledge) that integrates KG as a supplementary semantic layer, with potential applications in clinical decision support and population-level screening. The framework does not perform true causal inference but structures explanations around established pathophysiological knowledge. This study contributes a transferable methodology and a quantified research gap to guide future work integrating ML, XAI, and structured medical knowledge.

20

Validated Synthetic Data Generation from a Multicenter Spine Surgery Registry: Methodology and Benchmark

Challier, V.; Jacquemin, C.; Diebo, B.; Dehouche, N.; Denisov, A.; Cristini, J.; Campana, M.; Castelain, J.-E.; Lonjon, G.; Lafage, V.; Ghailane, S.; SpineDAO Collaborative Group,

2026-04-11 health informatics 10.64898/2026.04.07.26350316 medRxiv

Top 0.2%

4.5%

Show abstract

BackgroundSynthetic data have emerged as a complementary strategy for secondary use of clinical registries, enabling data sharing without patient-level exposure. In spine surgery, multicenter data sharing is constrained by institutional governance and patient privacy regulations. Validated synthetic data generation may enable broader access to surgical outcomes data for artificial intelligence development without compromising patient confidentiality. ObjectiveTo describe and benchmark a three-domain validated synthetic data pipeline applied to a multicenter, tokenized spine surgery registry (SpineBase), and to establish a reproducible certification framework for synthetic spine surgery datasets. MethodsWe extracted 125 sacroiliac joint fusion cases from the SpineBase registry (SIBONE study, IRB-SOFCOT approval Ref. 14-2025; CNIL MR-004 Ref. 2234503 v 0). A GaussianCopula generative model was trained on 52 structured variables spanning demographics, preoperative assessments, operative details, and longitudinal outcomes at 3, 6, 12, and 24 months. Synthetic datasets of 100, 1,000, and 10,000 patients were generated. Validation followed a three-domain framework: (1) fidelity, assessed by Kolmogorov-Smirnov tests and Jensen-Shannon divergence; (2) utility, assessed by train-on-synthetic, test-on-real (TSTR) methodology; and (3) privacy, assessed by nearest-neighbor distance ratio (NNDR), membership inference attack, and k-anonymity proxy. ResultsAll three validation gates passed. Fidelity: mean KS p-value 0.52 (threshold >0.05). Privacy: NNDR >1.0 in 98.9% of synthetic records; membership inference AUROC 0.57. Utility: 12-month Oswestry Disability Index prediction yielded Pearson r = 0.29, consistent with expected attenuation at N = 125. A SHA-256 cryptographic hash of each certified dataset was anchored on the Solana blockchain for immutable provenance. ConclusionsA validated, blockchain-anchored synthetic data pipeline for spine surgery registries is technically feasible and meets current publication-standard criteria for fidelity and privacy. Utility metrics scale with registry size, creating a direct incentive for multicenter data contribution. This framework provides a reproducible methodology for synthetic data certification in spine surgery research, and establishes certified synthetic datasets as a privacy-native substrate for expert-annotation pipelines -- as demonstrated in the companion Spine Reviews study.